Skip to content

feat: poll and print SLURM job estimated start time while pending#464

Merged
ko3n1g merged 3 commits intomainfrom
ko3n1g/feat/print-slurm-start-time
Mar 16, 2026
Merged

feat: poll and print SLURM job estimated start time while pending#464
ko3n1g merged 3 commits intomainfrom
ko3n1g/feat/print-slurm-start-time

Conversation

@ko3n1g
Copy link
Contributor

@ko3n1g ko3n1g commented Mar 14, 2026

Summary

When a SLURM job is submitted it may sit in the pending queue for minutes or hours with no feedback. This PR adds a lightweight background daemon thread that polls squeue --start every 30 seconds and prints the estimated start time to stdout, stopping automatically once the job leaves the pending queue.

  • _poll_job_start_time: new method on SlurmTunnelScheduler that runs squeue --start --noheader -j {job_id} -o '%i|%S|%T' in a loop, printing [SLURM] Job {id} - State: PENDING, Estimated start: <time> until the job starts or the stop event is set
  • schedule(): starts a daemon thread after _save_job_dir; stops any pre-existing thread for the same job_id first (retry/duplicate case)
  • _cancel_existing(): signals the polling thread when a job is cancelled
  • close(): replaces the ... stub — signals all polling threads and clears the tracking dicts

Edge cases handled:

  • stdout=None guard (result.stdout or "")
  • Non-zero return_code treated as empty (SLURM error text in stdout)
  • Array jobs (12345_1, 12345_2) deduplicated — only first line printed per cycle
  • Exception resilience — logs debug and retries after 30s wait
  • Duplicate job_id on retry — old thread is stopped before new one starts
  • Old SLURM without --me flag — uses -j {job_id} only

Test plan

  • 11 new TDD tests covering all edge cases above (test/run/torchx_backend/schedulers/test_slurm.py)
  • 2 existing test_schedule* tests updated to patch _poll_job_start_time (prevents polling thread from interfering with tunnel.run.assert_called_once())
  • All 33 tests in the file pass: uv run -- pytest test/run/torchx_backend/schedulers/test_slurm.py -v
  • ruff check + ruff format clean

🤖 Generated with Claude Code

When a SLURM job is submitted and sits in the queue, there is no feedback
about when it is expected to start. This adds a background daemon thread
per job that polls `squeue --start` every 30 seconds and prints the
estimated start time to stdout, stopping automatically once the job
leaves the pending queue.

Key details:
- `_poll_job_start_time`: new method guards against None stdout, non-zero
  return codes, and array-job multi-line output (prints only first line)
- Thread is started in `schedule()` and stopped in `_cancel_existing()`
  and `close()`; duplicate job_id (retry) stops the old thread first
- 11 new TDD tests cover all edge cases from the plan

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
chtruong814
chtruong814 previously approved these changes Mar 15, 2026
hemildesai
hemildesai previously approved these changes Mar 15, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Replace fixed 30s interval with exponential backoff (30s base, 2x factor,
capped at 15min) to reduce unnecessary polling for long-pending jobs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g merged commit f68f6f2 into main Mar 16, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants